CUBE CONNECT Edition Help

CUBE 7 Cluster Operation

As with previous versions, each machine must be able to reference the shared directory holding the model files in a consistent manner, whether that be via drive mapping or via UNC paths. That shared directory and share will require access to read and write the model files.

Cluster Manager

On each server host, the ClusterManager.exe executable must be running. By default, this listens for incoming connections on TCP port 57364, so it is essential the host firewall allows incoming connections to this port. The details of each required manager are detailed in the PILOT CLUSTER statement where the host connection details and desired number of slaves on each host are specified.

This statement is processed once per run so any host/slave configuration must occur in this statement.

The format of the control statement is:

CLUSTER [RUNID=text]
        SERVER="host_spec" NUMNODES=n [USERNAME="username" PASSWORD="password"]
        SERVER=…
        [WAIT=n] [NOHIDE=?]

The elements of the control statement:

  • RUNID – optional, specifies an identifying name for the run, at present this is not utilized.
  • SERVER – one or more host definitions, including localhost if that is a suitable host to use for slaves. The host definition is any resolvable name or IP address of a host running a Cluster Manager process.
    • PORT – port number of Cluster Manager to connect to (by default 57364)
    • NUMNODES – number of slave nodes to start on specified node for this run. Slaves will not be shared with other runs
    • USERNAME, PASSWORD – username and password of account on remote host for slaves to operate as. Only used if remote Cluster Manager is operating in service mode. Extra constrains are required in this case – there is no capability of using mapped drive letters, so UNC paths must be used, and each specified account on each host must have access to read/write from the shared directory.
  • WAIT – maximum amount of time in seconds to wait for notification that all requested slaves have started
  • NOHIDE – by default all slaves operate without visible windows. If for debug purposes it is desired to show the windows so model progress can directly be observed, this flag can be set to true so that each slave will display in it’s own window.

Intrastep

Intra-step is available in MATRIX and HIGHWAY modules for splitting node processing amongst multiple processing nodes. The main difference is that instead of specifying a "PROCESSID" name and a list of processor numbers in "PROCESSLIST", the "MAXPROCESSORS" parameter is used to specify that the Intra-step will use as many processors in the pool that are available up to the value of the parameter.

The "COMMPATH" and "WEIGHTS" parameters have been removed as all communications is now handled by the Cluster Manager and not files, and the weights configuration is impractical given the user does not select slave IDs. All other parameters remain the same.

Multistep

Multi-step changes are similar to Intra-step, in that the specification of a PROCESSID and PROCESSNUM/PROCESSLIST is now not possible due to the slaves being allocated automatically.

Two new parameters have been added:

  • IDVAR – specifies a variable that will receive the ID string of the assigned slave. This can then be used in a later wait statement (BARRIER)
  • ALIAS – defines a name that can be used in a later wait statement (BARRIER) instead of storing in a variable

Like Intra-step, COMMPATH has been removed due to it serving no purpose with communications being handled by Cluster Manager instead of using files.

Waiting

Waiting for Multi-step processes to finish used to be achieved by waiting for the specific process signal files. This is no longer appropriate for multi-step operations, but can be still used for waiting for other files, such as output files from other processes.

Waiting for Multi-step node completion is now performed via a BARRIER statement:

BARRIER
        IDLIST=’ID list string"
        [TIMEOUT=n]
        [CHECKRETURNCODE=?]
        [UPDATEVARS="var mask string"]
        [PRINTFILES=MERGE|MERGESAVE|DELETE|SAVE]
        [DELDISTRIBFILES=?]

Similarly to the WAIT4FILES control, the IDLIST is a list of all slave IDs previously recorded in the variables specified by IDVAR in the DISTRIBUTEMULTISTEP controls. Each of the other parameters has the same function as the matching WAIT4FILES parameters.

  • CHECKRETURNCODE – fails the run if a process being waited on has a return code >= 2
  • UPDATEVARS – specifies a list of global variables computed in Pilot and logged form individual programs that should be merged back from the sub-process run. Any variables with the first part of the name matching an UPDATEVARS name will be merged back.
  • PRINTFILES – indicates whether the slave sprint files is saved and/or merged back into the master print file – either MERGE, MERGESAVE, DELETE or SAVE
  • DELDISTRIBFILES – controls the disposition of the multi-step temporary files (script files). The default is true.

Script location

A new Pilot control REPORT_SCRIPT_LOCATION has been added to indicate to Application Editor the current location of execution:

REPORT_SCRIPT_LOCATION LOCATION_ID="string"